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Abstract 

Part of the Advanced Automation System (AAS) for air-traffic con- 
trol is a protocol to permit flight hand-off from one air-traffic con- 
troller to another. The protocol must be fault-tolerant and, therefore, 
is subtle — an ideal candidate for the application of formal methods. 
This paper describes a formal method for deriving fault-tolerant proto- 
cols that is based on refinement and proof outlines. The AAS hand-off 
protocol was actually derived using this method; that derivation is 
given. 


1 Introdution 

The next-generation air traffic control system for the United States is cur- 
rently being built under contract to the U.S. government by the IBM Federal 
Systems Company (recently acquired by Loral Corp.). Advanced Automation 
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System (AAS) [1] is a large distributed system that must function correctly, 
even if hardware components fail. 

Design errors in AAS software are avoided and eliminated by a host 
of methods. This paper discusses one of them — the formal derivation of a 
protocol from its specification — and how it was applied in the AAS protocol 
for transferring authority to control a flight from one air-traffic controller to 
another. The flight hand-off protocol we describe is the one actually used 
in the production AAS system (although the protocol there is programmed 
in Ada). And, the derivation we give is a description of how the protocol 
actually was first obtained. 

The formal methods we use are not particularly esoteric nor sophisti- 
cated. The specification of the problem is simple, as is the characterization 
of hardware failures that it must tolerate. Because the hand-off protocol is 
short, computer-aided support was not necessary for the derivation. De- 
riving more complex protocols would certainly benefit from access to a 
theorem proven 

We proceed as follows. The next section gives a specification of the 
problem and the assumptions being made about the system. Section 3 
describes the formal method we used. Finally, Section 4 contains our 
derivation of the hand-off protocol. 

2 Specification and System Model 

The air-traffic controller in charge of a flight at any time is determined by 
the location of the flight at that time. However, the hand-off of the flight 
from one controller to another is not automatic: some controller must issue 
a command requesting that the ownership of a flight be transferred from 
its current owner to a new controller. This message is sent to a process that 
is executing on behalf of the new controller. It is this process that starts the 
execution of the hand-off itself. 

The hand-off protocol has the following requirements: 

PI: No two controllers own the same flight at the same time. 

P2: The interval during which no controller owns a flight is brief (approx- 
imately one second). 

P3: A controller that does not own a flight knows which controller does 
own that flight. 
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The hand-off protocol is implemented on top of AAS system software 
that implements several strong properties about message delivery and exe- 
cution time [1 ]. For our purposes, we simplify the system model somewhat 
and mention only those properties needed by our hand-off protocol. 

The system is structured as a set of processes running on a collec- 
tion of processors interconnected with redundant networks. The services 
provided by AAS system software include a point-to-point FIFO interpro- 
cess communication facility and a name service that allows for location- 
independent interprocess communication. AAS also supports the notion 
of a resilient process s comprising a primary process s.p and a backup process 
s.b. The primary sends messages to the backup so that the backup's state 
stays consistent with the primary. This allows the backup to take over if 
the primary fails. 

A resilient process is used to implement the services needed by an 
air-traffic controller, including screen management, display of radar in- 
formation, and processing of flight information. We denote the primary 
process for a controller C as C.p and its backup process as C.b. If C is the 
owner of a flight /, then C.p can execute commands and send messages 
that affect the status of flight /; C.b, like all backup processes in AAS, only 
receives and records information from C.p in order to take over if C.p fails. 

AAS implements a simple failure model for processes [3]: 

SI: Processes can fail by crashing. A crashed process simply stops exe- 
cuting without otherwise taking any erroneous action. 

S2: If a primary process crashes, then its backup process detects this and 
begins executing a user-specified routine. 

Property S2 is implemented by having a failure detector service. This 
service monitors each process and, upon detecting a failure, notifies any 
interested process. 

If the hand-off protocol runs only for a brief interval of time, then 
it is safe to assume that no more than a single failure will occur during 
execution. So, we assume: 

S3: In any execution of the hand-off protocol, at most one of the partici- 
pating processes can crash. 

S4: Messages in transit can be lost if the sender or receiver of the message 
crashes. Otherwise, messages are reliably delivered, without corrup- 
tion, and in a timely fashion. No spurious messages are generated. 
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We also can assume that messages are not lost due to failure of net- 
work components such as controllers and repeaters. This is a reasonable 
assumption because the processors of AAS are interconnected with redun- 
dant networks and it is assumed that no more than one of the networks 
will fail. 

In any long-running system in which processes can fail, there must be a 
mechanism for restarting processes and reintegrating them into the system. 
We ignore such issues here because that functionality is provided by AAS 
system software. Instead, we assume that at the beginning of a hand-off 
from A to B, all four processes A.p, A.b , B.p, B.b are operational. 

3 Fault-tolerance and Refinement 

A protocol is a program that runs on a collection of one or more processors. 
We indicate that S is executed on processor p by writing: 

(S)atp (1) 

Execution of (1 ) is the same as skip if p has failed and otherwise is the same 
as executing 5 as a single, indivisible action. This is exactly the behavior 
one would expect when trying to execute an atomic action S on a fail-stop 
processor. 

Sequential composition is indicated by juxtaposition. 

(Si) at pi (Sz) atp 2 (2) 

This statement is executed by first executing (Si) atpi and then executing 
(S 2 ) at p 2 - Notice that execution of (S 2 ) atp2 cannot assume that Si has 
actually been performed. If pi fails before execution of (Si ) at pi completes, 
then the execution of (Si) at pi is equivalent to skip. Second, observe that 
an actual implementation of (2) when pi and pz are different will require 
some form of message-exchange in order to enforce the sequencing. 
Finally, parallel composition is specified by: 

cobegin (Si) at pi || (S 2 ) at pz || ... || (S„) at p n coend (3) 

This statement completes when each component (S,) at p,- has completed. 
Since some of these components may have been assigned to processors that 
fail, all that can be said when (3) completes is that a subset of the S, have 
been performed. If, however, we also know the maximum number t of 
failures that can occur while (3) executes, then at least n - t of the S, will 
be performed. 
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Proof Outlines 

We use proof outlines to reason about execution of a protocol. A proof 
outline is a program that has been annotated with assertions, each of which 
is enclosed in braces. A precondition appears before each atomic action, and 
a postcondition appears after each atomic action. Assertions are Boolean 
formulas involving the program variables. Here is an example of a proof 
outline. 

{x = 0 A y = 0} 

XI : x := x + 1 

{x = 1 A y — 0} 

X2 : y := y + 1 

{x = 1 A j; = 1} 

In this example, x = 0 A y = 0, x = 1 A y = 0, and x = 1 A y = 1 are 
assertions. Assertion x = 0Aj/ = 0isthe precondition of XI, denoted 
pre(Xl), and assertion x = lAj/ = 0isthe postcondition of XI, denoted 
post(X2). The postcondition of XI is also the precondition of X2. 

A proof outline is valid if its assertions are an accurate characterization 
of the program state as execution proceeds. More precisely, a proof outline 
is valid if the proof outline invariant 

/\ ((af(S) => pre( S)) A (after( S) =*> post( S))) 
s 

is not invalidated by execution of the program, where at( S) is a predicate 
that is true when the program counter is at statement S, and after( S) is a 
predicate this is true when the program counter is just after statement S. 

The proof outline above is valid. For example, execution starting in 
a state where x = lAy = 0A after (XI) is true satisfies the proof outline 
invariant and, as execution proceeds, the invariant remains true. Notice, 
our definition of validity allows execution to begin anywhere — even in 
the middle of the program. Changing posf(Xl) (and pre(X2)) to x = 1 
destroys the validity of the above proof outline. (Start execution in state 
x = 1Ai/ = 23a after (XI). The proof outline invariant will hold initially 
but is invalidated by execution of X2.) 

A simple set of (syntactic) rules can be used to derive valid proof out- 
lines. The first such programming logic was proposed in [2]. The logic that 
we use is a variant of that one, extended for concurrent programs [4]. 
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Additional extensions are needed to derive a proof outline involving 
statements like (1). Here is a rule for (1); it uses the predicate up(p ) to assert 
that processor p has not failed. 

Action at Processor: - r7V7r\^J 

{A} ( S ) atp{«p(p)=>P} 

Since execution of (5) at p when p has crashed is equivalent to a skip, 
one might think that 

{A} { S ) at p {( up{p ) =>• B) A (i up(p) =>• A} (4) 

should be valid if {A} S {B} is. Proof outline (4), however, is not valid. 
Consider an execution that starts in a state satisfying A and suppose p has 
not crashed. According to the rule's hypothesis, execution of S would 
produce a state satisfying B. If process p then crashed, the state would 
satisfy ->«p(p) A B. Unless B implies A, the postcondition of (4) no longer 
holds. 

The problem with (4) is that the proof outline invariant is invalidated 
by a processor failure. The predicate up(p) changing value from true to false 
causes the proof outline invariant to be falsified. We define a proof outline 
to be fault-invariant with respect to a class of failures if the proof outline 
invariant is not falsified by the occurence of any allowable subset of those 
failures. 

For the hand-off protocol, we are concerned with tolerating a single 
processor failure. We, therefore, are concerned with proof outlines whose 
proof outline invariants are not falsified when up(p) becomes false for a sin- 
gle processor (provided up(p) is initially true for all processors). Checking 
that a proof outline is fault-invariant for this class of failures is simple: 

Fault-Invariance: For each assertion A: 

(A A f\up(p))=> f\A[up(p')\= false] 

P p 9 

where L[a ::= e] stands for L with every free occurrence of x replaced by e. 

4 Derivation of the Hand-off Protocol 

Let CTR(f) be the set of controllers that own flight /. Property PI can then 
be restated as 


6 



PV: \CTR(f)\ < 1. 


Desired is a protocol Xfer(A, B) satisfying 

{A £ CTR(f) A PI'} 

Xfer(A,B) 

{B e CTR(f) A PI'} 

such that PV holds throughout the execution of Xfer(A, B). 

A simple implementation of this protocol would be to use a single 
variable ctr(f) that contains the identity of the controller of flight / and to 
change ctr(f) with an assignment statement: 

{A e ctr(f) A PV} 
ctr(f):= (ctr(f) - {A }) u {B} 

{B e ctr(f) A PV} 

This implementation is problematic because the variable ctr(f) must 
reside at some site. Not only does this lead to a possible performance 
problem, but it makes determining the owner of / dependent on the avail- 
ability of this site. Therefore, we represent CTR(f ) with a Boolean variable 
C.ctr(f) at each site C, where 

CTR(f ) : {C|C.cfr(/)}. 

By doing so, we now require at least two separate actions in order to 
implement Xfer (A,B ) — one action that changes A.ctr(f ) and one action 
that changes B.ctr(f). Using the Action at Processor Rule, we get: 

{A € CTR(f) A PV} 

XI : (A.ctr(f):= false) at A 

{(up(A) =* ((A $ CTR(f)) A (CTR(f) = 0))) A PV} 

{CTR(f) = 0} 

X2: (B.ctr(f):= true) at B 

{(up(B) => (B € CTR(f))) A PV)} 

Note that pre(X 2) must assert that CTR(f) = 0 holds, since otherwise exe- 
cution of X2 invalidates PV. 

The preconditions of XI and X2 are mutually inconsistent, so these 
statements cannot be executed in parallel. Moreover, X2 cannot be run first 
because pre(X 2), CTR(f) = 0, does not hold in the initial state. Thus, X2 
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must execute after XI. Unfortunately, posf(Xl) does not imply preiX. 2); if 
up(A ) does not hold, then we cannot assert that CTR(f) = 0 is true. This 
should not be surprising: if A fails, then it cannot relinquish ownership. 

One solution for this availability problem is to employ a resilient pro- 
cess. That is, each controller C will have a primary process C.p and a backup 
process C.b executed on processors that fail independently. Each process 
has its own copy of C.ctr(f), and these copies will be used to represent 
C.ctr(f) in a manner that tolerates a single processor failure: 


C.ctr(f) : 


C.p.ctr(f) if up(C.p) 
C.b.ctr(f) if ^up(C.p) 


Since we assume that there is at most one failure during execution of 
the protocol, the above definition never references the variable of a failed 
process. Replacing references to processor "A" in Statement XI with “A.p" 
produces the following: 

{A 6 CTR(f) A PI'} 

Xla : (A.p.ctr(f):= false) at A.p 

{(up(A.p) => ((A $ CTR(f)) A (CTR(f) = 0))) A PV) 

This proof outline is not fault-invariant, however. If A.p were to fail 
when the precondition holds, then the precondition might not continue 
to hold. In particular, if A e CTR(f) holds because A.p.ctr(f) is true and 
A.b.ctr(f) happens to be false, then when A.p fails, A e CTR(f) would not 
hold. We need to assert that A.p.ctr(f) — A.b.ctr(f ) also holds whenever 
pre{X la) does. We express this condition using the following definition: 


Fr (up(A.p) A up(A.b)) => (A.b.ctr(f) = A.p.ctr(f )) 


Note that if one of A.p or A.b has failed then A.p.ctr(f) and A.b.ctr(f) need 
not be equal. Adding Pr to pre(Xla) gives the following proof outline, 
which is fault-invariant for a single failure: 

{A € CTR(f) A PV A Pr} 

Xla: (A.p.ctr(f):= false) at A.p 

{(up(A.p) => ((A i CTR(f)) A (CTR(/) = 0))) A PV) 


We need more than just Xla to implement XI, however. Xla does not 
re-establish Pr, which must hold for subsequent ownership transfers. This 
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suggests that A.b.ctr(f) also be updated. Another problem with Xla is 
that posf(Xla) still does not imply pre(X 2): if up(A.p) does not hold, then 
CTR(f) = 0 need not hold. 

An action whose postcondition implies 

up(A.b) => (-iA.b.ctr(f) A (-> up(A.p) => (CTR(f) = 0)) 

suffices. By our assumption, up(A.p) V up(A.b ) holds, so this postcondition 
and post(X1 a) will together allow us to conclude CTR(f) = 0 holds, thereby 
establishing pre(X 2). Here is an action that, when executed in a state 
satisfying pre(Xla), terminates with the above assertion holding: 

{(A £ CTR(f)) A PV A Pr} 

Xlb : (A.b.ctr(f):= false) at A.b 

{up(A.b) (-. A.b.ctr(f ) A (-up(A.p) => (CTR(f) = 0)))} 

One might think that since Xla and Xlb have the same preconditions they 
could be run in parallel, and the design of the first half of the protocol 
would be complete. Unfortunately, we are not done yet. 

The original protocol specification implicitly restricted permissable own- 
ership transitions. Written as a regular expression, the allowed sequence 
of states is: 

(CTR(f) = {A}f ( CTR(f ) = n ( CTR(f ) = {B}) + (5) 

That is, first A owns the flight, then no controller owns the flight for zero 
or more states, and finally B owns the flight. The proof outline above does 
not tell us anything about transitions; it only tells that PV holds throughout 
(because PV is implied by all assertions). We must strengthen the proof 
outline to deduce that only correct transitions occur. 

A regular expression (like the one above) can be represented by a finite 
state machine that accepts all sentences described by the regular expression. 
Furthermore, a finite state machine is characterized by a next-state transi- 
tion function. The following next-state transition function S AB characterizes 
the finite state machine for (5): 

f {{A}, 0, {B}} if A G CTR(f) 

6ab : \ {0,{B}} if CTR(f) = 0 

{ {{B}} if B € CTR(f) 

The value of S AB is the set of values of CTR( f) that are next allowed for the 
protocol. For example, when CTR(f) = 0 holds, S AB says that a transition to 
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a state in which CTR(f) = 0 holds or to a state in which CTR(f) = {B} holds 
are the only permissible transitions. Note that since PV holds, we know 
that the three cases A G CTR(/),CTR(f ) = 0,B G CTR(f) are mutually 
exclusive, so Sab always has a unique value. 

We further define QSab to be the value of Sab in the previous state 
during the execution of the hand-off protocol, or {{y4},0, {B}} if there is 
no previous state. Our hand-off protocol only will make permissable state 
transitions provided each assertion implies that CTR(f ) G QSab', that is, 
provided the current owner of / is one of the owners that was acceptable 
as the "next owner" in the previous state of the system. 

We therefore add conjunct CTR(f) G QSab to the assertions in the proof 
outline and check to see if the stronger proof outline is valid. If it is valid, 
then we can move on to implementing X2, the second part of Xfer(^4, B); 
otherwise, we will have reason to make further modifications. 

Here is the (strengthened) proof outline with Xla and Xlb running in 
parallel: 

{( CTR(f ) € QSab A A € CTR(f) A PI' A Pr} 

cobegin 

{ CTR(f ) € 0<5/ib A A G CTR(f) A PV A Pr} 

Xla: {A.p.ctr(f):= false) at A.p 

{CTR(f) e QSab A 

(up(A.p) => (A ? CTR(f) A ( CTR(f ) = 0)) A PV} 

|| { CTR{f ) € QSab A (A e CTR(f)) A PV A Pr} 

Xlb: (A.b.ctr(f):= false) at A.b 
{CTR(f) G QSab a 

(up(A.b) => A.b.ctr(f ) A (^up(A.p) =» (CTR(f) = 0)))} 

coend 

{CTR(f) G QSab A 

(up(A.p) ^(A? CTR(f) A (CTR(f) = 0))) A 
(up(A.b) => (- A.b.ctr(f ) A (^up(A.p) => ( CTR(f ) = 0))) 

A PV A Pr} 

Unfortunately, this proof outline is not fault-invariant. If A.p fails in a state 
satisfying after(X la) A at(Xlb) then the following holds before the failure: 

after(Xla ) A flf(Xlb) A up(A.p) 

A up(A.b ) A ( CTR(f ) = 0) 

A -> A.p.ctr(f ) A A.b.ctr(f ) A QSab = {0*#} 
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After the failure, we have: 

after(X la) A af(Xlb) A -*up(A.p ) 

A up{A.b) A (A € CTR(f)) 

A - <A.p.ctr(f ) A A.b.ctr(f ) A QS AB = {0, B} 

So, CTR(f) £ OSab does not hold after the failure, and the first conjunct of 
posf(Xla) is invalidated. One simple solution is to preclude states where 
flf(Xlb) A after(X la) holds. This can be done by running the two actions in 
sequence — first Xlb and then Xla. The result is described by the following 
proof outline: 


{ CTR(f ) € Q6 A b APV A Pr A A£ CTR(f)} 

Xlb: (A.b.ctr(f):= false) at A.b 

{CTR(f) £ QS ab A PV A A.p.ctr(f) A 
(up(A.b) => ->A.6.cfr(/))} 

Xla: (A.p.ctr(f):= false) at A.p 

{CTR(f) £ G6 ab A PV A (up(A.b) => - A.b.ctr{f )) 

A (up(A.p) => -iA.p.ctr(f)) A Pr} 
therefore, according to the definitions of CTR(f) and A.ctr(f), 
{CTR(f) £ Q6 ab A A $ CTR(f) A PV A Pr} 

What we really want to conclude in posf(Xla), however, is CTR(f) = 0 — 
not just A g CTR(/). This is easily done by strengthening the above proof 
outline with the following: 

POnly(A): For all controllers C: C ^ A: C g CTR( f) 

POnly(A ) is initially true because A £ CTR(f) A PI' holds. It is not 
invalidated by any assignment, because the only variables assigned to are 
those of A.p and A.b. So, POnly(A) remains true throughout the execution 
of XI. 

The derivation of a protocol for X2 is basically the same, except that A 
is replaced by B and false is interchanged by true. Doing so results in the 
proof outline shown in Figure 1. 

4.1 Implementing P3 

Property P3 of Section 2 is satisfied by the protocol in Figure 1 as long as 
there are exactly two controllers. When there are more than two controllers. 
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{CTR(f) e 0S AB A PV A Pr A POnly(A) A A e CTR(f)} 
Xlb : (A.b.ctr(f):= false) at A.b 

{CTR(f) e QS ab A PV A POnly(A) A 
A.p.ctr(f ) A B £ CTR(f) A (up(A6) =s- -A6.cfr(/))} 
Xla : { A.p.ctr(f):= false) at A.p 

{CTR(f) 6 Q6 AB A PI' A PrAPOnly(A) 

A B £ CTR(f) 

A (Mp(A6) =$- -iA.b.ctr(f)) 

A (up(A.p) =* -v4.p.ctr(/))} 
therefore, according to the definitions of CTR(f), 
A.ctr(f), and POnly(B) 

{CTR(f) e 0^ B A PP A PrAPOnly(B) A CTP(/) = 0} 
X2b : (B.b.dr(f):= true)itB.b 

{CTR(f) € 0.5^8 A PV A POnly(B) A ^B.pxtr(f) A 
(up(B.ft) => B.b.ctr(f))} 

X2a : {B.p.ctr(f):= true) at B.p 

{C77?(/) € 0(5/18 A PV A Pr A POnly{B) 

A ( up(B.b ) =*> B.b.ctr(f)) A ( up(B.p ) =>■ B.p.ctr(f )) 
therefore, according to the definitions of CTR(f) 
and B.ctr(f), 

{B e 08 , is A PI' A Pr A POnly(B) A Be CTR(f)} 


Figure 1: Hand-off Protocol for A and B 
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a controller must query other controllers in order to determine which owns 
a flight. Doing so is inefficient, so we instead consider having each con- 
troller C maintain a variable C.ctrlD(f) that names the owner of flight /. 
As with C.ctr(f), we represent the value of C.ctrlD(f) in a manner that 
tolerates a single site failure: 


C.ctrlD(f): 


C.p.ctrlD(f) if up(C.p) 
C.b.ctrlD(f) if ->up(C.p) 


( 6 ) 


This variable can be used to implement the Boolean C.cfr(/)by defining: 


C.ctr(f): (C.ctrlD(f) = C) 

Thus, the assignment " C.ctr(f):= true ” would be replaced by “C.ctrID(f):= 
C", and "C.ctr(f):= false " would be replaced by “C.ctrID(f):= X" for any 
value X j - C. 

We can rewrite P3 as the following: 


P3 (3 C: (C.ctrlD(f) = C )) =* 

(3 C: (C.ctrlD(f) = C) A (VC': C'.ctrID{f ) = C)). 

For the protocol of Figure 1, P3' holds when af(X2b) is true because the 
the antecedent is false. Furthermore, if we explicitly assign A.ctrID(f)\= B 
as the assignments Xlb and Xla, then P3 holds throughout the execution, 
provided C' ranges over the set {A, B}. For the other controllers, additional 
statements are needed, shown in Figure 2. 

Since Z(A, B) in Figure 2 changes the values of C.ctrID(f), it should 
be executed when CTR(f) = 0 holds, because otherwise its execution may 
violate P3 1 . Thus, Z(A, B) would have to be started no earlier than after (XI a) 
and terminate by flf(X2a). Unfortunately, Z(A, B) may take a significant 
amount of time — even though its component statements can be executed 
in parallel, the time to execute Z(A, B) will include some communication 
and synchronization overhead. This extra time could make satisfying P2 
hard or impossible. 

Property P3' is perhaps a bit too strong. In fact, all that is really required 
is that a controller be able to communicate with the process that owns a 
flight. For example, C.ctrlD(f) could be the start of a path of controllers, 
terminating with the current owner. The scheme where C.ctrlD(f ) indicates 
the current owner is equivalent to requiring that this path have a length of 
1. But, longer paths are also acceptable. 

Let C denote that C.ctrlD(f) = C', and let C C' denote the 
transitive closure of Using this notation, P3 1 can be expressed as: 
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Z(A, B ) : 

{true} 

cobegin 

Zp: 

\\c-.Ci{A,B }= {true} 

(C.p.ctrlD(f):= B) at C.p 

Zb: 

{up(C.p) =* (C.p.ctrID(f) = B)} 

II {true} 

(C.b.ctrID(f):= B) atC.6 


{ up{C.b ) => ( C.b.ctrID(f ) = B)} 

coend 

{up(C.p) => ( C.p.ctrID(f ) = B) 

A up(C.b) => (C.b.ctrlD(f) = B)} 
therefore, according to the definitions of C.ctrlD(f) 
{(VC£{A,B}:C->B)} 


Figure 2: Hand-off Protocol for Controllers other than A and B 


P3': (3C:C^C)^(3C:C~~>C/\ (VC' : C' ^ C )). 

We weaken P3> as follows: 

P3 (3C : C C) => {3C : C — C A (VC' : C' C)). 

P3" is left invariant by the protocol in Figure 1. F3" is also an invariant 
of the protocol of Figure 2 provided B ^ B V B A initially holds. From 
post(Z(A, B)) and post(X2a), we conclude that as long as the execution 
of Z (A, B) completes before another hand-off starts, PS' will hold once 
Z(A, B) and the protocol in Figure 1 have both terminated. Since P3' implies 
B B V B A, the system is once again in a state from which a hand-off 
can be performed. Hence, Z(A, B) can begin executing at any point during 
the hand-off from A to B — because its precondition, B B V B ^ A, holds 

throughout the protocol of Figure 1. And, Z (A, B) must complete before a 
subsequent hand-off has started. 

4.2 Implementation using Messages 

So far, the protocol we have derived consists of assignment statements to 
various variables that reside on separate processes. The protocol consists 
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of the three processes, as follows: 


cobegin 

Xlb : 

{A.b.ctr(f):- false) at A. b 

Xla : 

(A.p.ctr(f):= false) at A.p 

X2b : 

( B.b.ctr(f):= true) it B.b 

X2a : 

( B.p.ctr(f):= true) at B.p 

Zp: 

Wc-4{a,bY- ( C.p.ctrID(f):= B) at C.p 

Zb: 

\\c\${a,b}'‘ (C.b.ctrID(f):= B) at C.b 

coend 



An actual implementation would require that each assignment state- 
ment be executed by the processor whose variable is being set. Further- 
more, the assignment statements of the first process must be sequenced. 
This sequencing will be accomplished in our implementation by processor 
B.p, since this processor starts the protocol. If B.p crashes, then B. 6 will take 
over the sequencing. Because all assignments are constants to variables, 
when taking over, B.b can simply start at the beginning of the sequence — it 
not need to know how far B.p got before failing. 

B.b does need to know when B.p has finished executing the hand-off 
protocol. Otherwise, a crash of B.p might cause B.b to re-execute the hand- 
off from A to B after / has been later handed off to another controller, 
in which case B.b would undo that later hand-off. Hence, B.b must be 
notified of the completion of the hand-off before any subsequent hand-offs 
are started. We represent the fact that a hand-off from A to B is in progress 
with a variable B.b.xfr, whose value is initially ±. 

In order to continue the implementation using messages, some further 
details of the AAS system services must be given. 

• Communication between resilient processes uses send and receive. 
If some process sends a message m to a resilient process C, then m is 
enqueued at C.p if C.p has not crashed and enqueued at C.bii C.p has 
crashed. Furthermore, send does not return control until the message 
has been enqueued at the remote process. The remote process may 
crash after enqueueing m but before delivering m, in which case m is 
lost. 

• The primary of a resilient process communicates with its backup 
using log. Like send, log does not return control until the message is 
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enqueued by the remote process. A log that is executed when there is 
no backup (for example, when the backup has crashed or when log is 
executed by the backup itself) does nothing and immediately returns 
control. 

• Until the primary of a resilient process crashes, the backup delivers 
only messages sent by log. 

• When primary C.p crashes, C.b takes over by first processing any 
enqueued messages sent by C.p using log. It then executes the user- 
defined recovery protocol. And, finally, it receives messages sent to 

C. 

We also use a variable in each process to represent the value of variables 
C.p.ctrlD(f) and C.b.ctrlD(f). A simple approach would be to introduce 
C.p.owner(f ) and C.b.owner(f), such that: 

C.p.ctrlD(f): C.p.owner(f) 

C.b.ctrID(f ): C.b.owner(f). 

Doing so, however, is inefficient (as well as difficult given the AAS com- 
munication primitives). Consider Xlb in the hand-off protocol. To im- 
plement Xlb, B.p would send a message to A.b instructing it to execute 
A.b. owner {f)\= B. Since Xlb must complete before Xla starts, B.p cannot 
start Xla before A.b completes its assignment. The result is two end-to-end 
message delays. 

A more efficient hand-off protocol can be implemented using the follow- 
ing definitions of C.p.ctrID( f) and C.b.ctrID(f). Let the predicate Ec(f,X) 
mean that C.b has enqueued but not yet processed a log from C.p that re- 
quests the execution of C .b. owner (/):= X, and let Vc{f) be the value of X 
in the most recent such log message. Then, we define: 

C.p.ctrID(f): C.p.owner(f) 


C.b.ctrlD(f): 


C.b.owner(f ) if (VX: ->E c (f, X)) 
VcU) if (3 X:E c {f,X)) 


B.p can cause the execution of Xlb followed by Xla simply by sending 
a single message to A requesting execution of owner(f):= B. Upon delivery 
of this message, A.p first executes a log so A.b learns of the message. Since 
log does not return until E^f, B ) holds, post(Xlb) holds when log returns. 
A.p can then establish posf(Xla) by executing C. p. owner (f):= B. 


\ 
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cobegin 

{pre(Xlb)} 

(log “xfr.= ( f,A )") at B.p 
(send "owner(f):= B" to A) at B.p 
(wait for "ack" from A) at B.p 
(posf(Xla) A pre(X 2b)} 

(log "owner{f):= B ") at B.p 
(posf(X2b) A pre(X2a)} 

( B.p.owner(f):= B) at B.p 
{post (X2a)} 

(VC : C g { A,B }: send "owner(f):= B" to C) at B.p 
(log “xfr:= ±") at B.p 

Wc-.c^tB ■ (when deliver " owner (f):= X") at C.p 
(log " owner(f):= X ") at C.p 
{(C = A) =*► post(Xlb) A (C ^ .4) =>• posf(Zb)} 

(C .p. owner (/):= X) at C.p 
{(C = A) =► post (XI a) 

A (C ^ A) =>• (posf(Zp) A post(Zb))} 

(send "ack" to X) at C.p 

||c : (when deliver "x:= v" from C.p do C.b.x:= v) at C .6 

||e : (when C.p fails do 

if xfr = (f,X) 

then start hand-off of / from X to C) at C.b 

coend 


Figure 3: Complete Hand-off Protocol 
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The complete hand-off protocol is shown in Figure 3. The assertions in 
the code refer to Figures 1 and 2. 
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